Sheet
import warnings
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
warnings.filterwarnings("ignore")
df = pd.read_csv('hotel_bookings.csv')
df.head(5)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 hotel 119390 non-null object
1 is_canceled 119390 non-null int64
2 lead_time 119390 non-null int64
3 arrival_date_year 119390 non-null int64
4 arrival_date_month 119390 non-null object
5 arrival_date_week_number 119390 non-null int64
6 arrival_date_day_of_month 119390 non-null int64
7 stays_in_weekend_nights 119390 non-null int64
8 stays_in_week_nights 119390 non-null int64
9 adults 119390 non-null int64
10 children 119386 non-null float64
11 babies 119390 non-null int64
12 meal 119390 non-null object
13 country 118902 non-null object
14 market_segment 119390 non-null object
15 distribution_channel 119390 non-null object
16 is_repeated_guest 119390 non-null int64
17 previous_cancellations 119390 non-null int64
18 previous_bookings_not_canceled 119390 non-null int64
19 reserved_room_type 119390 non-null object
20 assigned_room_type 119390 non-null object
21 booking_changes 119390 non-null int64
22 deposit_type 119390 non-null object
23 agent 103050 non-null float64
24 company 6797 non-null float64
25 days_in_waiting_list 119390 non-null int64
26 customer_type 119390 non-null object
27 adr 119390 non-null float64
28 required_car_parking_spaces 119390 non-null int64
29 total_of_special_requests 119390 non-null int64
30 reservation_status 119390 non-null object
31 reservation_status_date 119390 non-null object
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB

df.isnull().sum().sort_values(ascending=False)
df.describe()
df.hist(figsize=(20,14))
plt.show()
nan_replacements = {"country""Unknown""agent"0"company"0}
df = df.fillna(nan_replacements)
df["meal"].replace("Undefined""SC"inplace=True)
nan_value = float("NaN")
df.replace(nan_value, 0.0,  inplace=True)

# Some rows contain entreis with 0 adults, 0 children and 0 babies. 
zero_guests = list(df.loc[df["adults"]
                   + df["children"]
                   + df["babies"]==0].index)
df.drop(df.index[zero_guests], inplace=True)
df.drop_duplicates(inplace=True)
plt.figure(figsize = (16,16))
ax=sns.heatmap(df.corr(), annot=True)
plt.show()
target_correlation = df.corr()['is_canceled'].sort_values(ascending = False#.abs()
target_correlation
df = df.drop(['reservation_status_date'],axis=1)
num_features = ['lead_time','arrival_date_year','arrival_date_week_number','stays_in_weekend_nights',
               'stays_in_week_nights','adults','children','babies','is_repeated_guest','previous_cancellations',
                'previous_bookings_not_canceled','booking_changes','days_in_waiting_list','adr','required_car_parking_spaces',
                'total_of_special_requests']
cat_features = ['hotel''arrival_date_month','meal','country','market_segment',
                'distribution_channel','reserved_room_type','assigned_room_type','deposit_type','agent',
                'company','customer_type','reservation_status']
features = num_features + cat_features
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
cat_data = df[cat_features]
for i in cat_data:
    cat_data[i] = le.fit_transform(cat_data[i])
cat_data
num_data = df[num_features]
num_data['children'] = num_data['children'].astype('int')
X = pd.concat([cat_data, num_data], axis = 1)
y = df['is_canceled']
print(X.shape,y.shape)
(87230, 29) (87230,)

X.head()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3random_state=21stratify=y)
rf_model_enh = RandomForestClassifier(n_estimators=160,
                               max_features=0.4,
                               min_samples_split=2,
                               n_jobs=-1,
                               random_state=0)
rf_model_enh.fit(X_train, y_train)
y_hat = rf_model_enh.predict(X_test)
rf_model_enh.score(X_test, y_test)
1.0
print(y_hat[y_hat != y_test].size)
0

Wow, what an amazing model! :)

HW 2 Tasks

TO-DO:

  • For the selected observation from the dataset, calculate the model prediction.
  • For the selected observation from point 1, calculate the model prediction decomposition using LIME (packages in R: live, lime, localModel, iml).
  • Compare LIME decomposition for different observations in the set. How stable are the received explanations?
  • Comment on the individual results obtained in the above paragraphs.

Step 1: Prediction calculation

I've chosen the observation below for the prediction and explanation:

i1 = 5252
X_test.iloc[[i1]]
print("Prediction:", rf_model_enh.predict(X_test.iloc[[i1]]))
print('True value:', y_test.iloc[i1])
Prediction: [0]
True value: 0

As expected, the prediction was correct.

Step 2: model prediction decomposition calculation

Let us create an explainer object: we are dealing with a classification task, hence the mode. The class names I chose are Not Canceled for 0 and Canceled for 1.

from lime.lime_tabular import LimeTabularExplainer 

explainer = LimeTabularExplainer(X_train.values, 
                                mode='classification',
                                feature_names=X_train.columns,  
                                class_names=['Not Canceled''Is Canceled'],
                                verbose=True
                                random_state=21)

Now it's time to make the first explanation. However, we need to modify the prediction function first, as the default one is not compatible with the lime explainer. I've limited the number of features to 15 after a couple of executions with different parameters and examining their results.

predict_fn_rf = lambda x: rf_model_enh.predict_proba(x).astype(float)
explanation = explainer.explain_instance(X_test.iloc[[i1]].values[0], predict_fn_rf, num_features=15)

explanation.show_in_notebook()
Intercept 0.9694310575406664
Prediction_local [0.00536495]
Right: 0.0

We observe that most variables have little or no impact on target. On the left, we see prediction probabilities, in the middle are weights the scales of influence for each variable in descending order. On the right, we see how each of the variables is in favour of either 1 (orange colour) or 0 (blue). The reservation_status variable has the strongest negative influence, deposite_type and previous_cancellations variables have a slight negative impact, a required_car_parking_spaces variable has some positive impact.


plt.rcParams["figure.figsize"] = (16,35)
with plt.style.context('seaborn'):
    explanation.as_pyplot_figure()

On the plot we see the same results, but it's easier to read.

Step 3: LIME Decomposition comparison

I chose two additional observations to compare their explanations to the first one.

i2 = 2525
X_test.iloc[[i2]]
print("Prediction:", rf_model_enh.predict(X_test.iloc[[i2]]))
print('True value:', y_test.iloc[i2])
Prediction: [0]
True value: 0

explanation2 = explainer.explain_instance(X_test.iloc[[i2]].values[0], predict_fn_rf, num_features=15)

explanation2.show_in_notebook()
Intercept 1.0233204503822904
Prediction_local [-0.03758862]
Right: 0.0

Here we can notice that required_car_parking_spaces has a negative impact, and the weights of deposite_type and previous_cancellations variables have changed. days_in_waiting_list also has some weight, while in the first example, it did not have.

with plt.style.context('seaborn'):
    explanation2.as_pyplot_figure()
i3 = 5225
X_test.iloc[[i3]]
print("Prediction:", rf_model_enh.predict(X_test.iloc[[i3]]))
print('True value:', y_test.iloc[i3])
Prediction: [1]
True value: 1

In this case, we receive a prediction that booking is going to be canceled, so it might be interesting to take a look at the explanation of such prediction.

explanation3 = explainer.explain_instance(X_test.iloc[[i3]].values[0], predict_fn_rf, num_features=15)

explanation3.show_in_notebook()
Intercept -0.0037589793898292756
Prediction_local [0.97196899]
Right: 1.0

Again, nothing is even close to the impact that reservation_status has. There it has 0 value, so the impact is positive. We can also notice that required_car_parking_spaces have a positive impact. Interestingly, in this situation, the babies variable has more influence than the deposite_type and previous_cancellations variables. Moreover, there way many variables have some impact on the target (have non-zero values in lime decomposition).

with plt.style.context('seaborn'):
    explanation3.as_pyplot_figure()

Conclusions:

As it was expected, the reservation_status is the most important variable, which is easy to understand, even without using any complex models for prediction. The interesting fact in the results that we obtained is how the number of parking spaces affects the prediction, as for me this is definitely not something evident. I believe that previous cancellations and days spent on the waiting list are the factors that hotels would also take into attention when they receive a booking request.

When it comes to the explanation stability, we see that some of the variables would have different impact on the prediction even with the same values (previous cancellations and days on the waiting list in case 1 and 2),however,I think this might be because of the change of required_car_paring_spaces variable's value. There are no significant changes in those weights simply because of the reservation_status importance.

I also want to say that I'm really surprised by how well the model predicted the observations this time; for me, the results are almost suspicious, so might take one more closer look at the process later.

The hw 2 was really interesting and I definitely learned a lot during the process, so thank you and have a great day :)

Created using FigmaCreated using Figma